\(~\)
\(~\)
This week, we will introduce how to use visualization to observe patterns revealed in our data. There are two major sets of tools for creating plots in R:
tidyverse. For instance,\(~\)
We will be focusing on ggplot2 in our class. Because:
\(~\)
Research methods classes generally teach important skills such as probability and statistical theory, linear regressions, maximum likelihood estimation (MLE), machine learning, etc. While these are important methods for analyzing data and assessing research questions, sometimes drawing a picture (a.k.a. visualization) should be a first step and can be even more precise than conventional statistical computations.
\(~\)
Okay, let’s get started!
\(~\)
\(~\)
For the following examples, we will be using the gapminder dataset. Gapminder is a country-year dataset with information on life expectancy, among other things.
\(~\)
If you have not already installed the gapminderpackage
and you try to load it using the following code, you will get an
error:
\(~\)
library(gapminder)
Error in library(gapminder) : there is no package called ‘gapminder’
\(~\)
If this happens, install the gapminder package by
running install.packages("gapminder") in your console.
\(~\)
Once you’ve done this, run the following code to load the
gapminder dataset, the tidyverse library,
which includes ggplot2:
\(~\)
library(tidyverse)
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.0.2
gap <- gapminder
head(gap)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
\(~\)
\(~\)
Once you load the date, based on what we’ve learned in previous classes, discuss the following questions within your group.
\(~\)
(Hint: You can also run ?gapminder in the console to
open the help file for the data and definitions for each of the
columns.)
\(~\)
\(~\)
The general call for ggplot2 looks like this:
\(~\)
ggplot(data =, aes(x = , y = )) +
geom_xxxx() +
geom_yyyy()
\(~\)
The grammar involves some basic components:
\(~\)
The key to understanding ggplot2 is thinking about a
figure in layers: just like you might do in an image
editing program like Photoshop
\(~\)
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
\(~\)
So the first thing we do is call the ggplot function.
This function lets R know that we’re creating a new plot, and any of the
arguments we give the ggplot function are the global
options for the plot: they apply to all layers on the plot.
\(~\)
For the second argument we passed in the aes function,
which tells ggplot how variables in the data map to
aesthetic properties of the figure, in this case the x and y locations.
Here we told ggplot we want to plot the
lifeExp column of the gapminder data frame on the x-axis,
and the gdpPercap column on the y-axis.
\(~\)
Notice that we didn’t need to explicitly pass aes these columns (e.g., x = gapminder$lifeExp), this is because ggplot is smart enough to know to look in the data for that column!
\(~\)
Then, we need to tell ggplot how we want to visually represent the
data, which we do by adding a new geom layer. In our
example, we used geom_point, which tells ggplot we want to
visually represent the relationship between x and y as a scatterplot of
points:
\(~\)
IMPORTANT: In ggplot, you are adding
layers, so you should use + to separate each line of
code!
IMPORTANT: In ggplot, you are adding
layers, so you should use + to separate each line of
code!
IMPORTANT: In ggplot, you are adding
layers, so you should use + to separate each line of
code!
\(~\)
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
\(~\)
\(~\)
\(~\)
aes\(~\)
In the previous examples and challenge we’ve used the
aes function to tell the scatterplot geom
about the x and y locations of each
point. Another aesthetic property we can modify is the point
color.
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point()
Then, we can add a line of code to set your color manually. You can also google the R color palette for detail color code.
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point() +
scale_color_manual(values = c("gold", "lightblue", "red", "lightgreen", "pink"))
Furthermore, you can modify the opacity of points by
alpha in your geom_point setting.
alpha is in a range from 0 to 1.
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha = 0.5)
Color isn’t the only aesthetic argument we can set to display variation in the data. We can also vary by shape, size, etc. For example, we can also set the shape by continent too.
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent, shape = continent)) +
geom_point(alpha = 0.5)
\(~\)
\(~\)
In the previous challenge, you plotted lifExp over time.
Using a scatterplot probably isn’t the best for visualising change over
time. Instead, let’s tell ggplot to visualise the data as a
line plot:
ggplot(data = gap, aes(x = year, y = lifeExp, by = country, color = continent)) +
geom_line()
Instead of adding a geom_point layer, we’ve added a
geom_line layer. We’ve also added the by aesthetic,
which tells ggplot to draw a line for each country.
\(~\)
But what if we want to visualize both lines and points on the plot? We can simply add another layer to the plot:
ggplot(data = gap, aes(x = year, y = lifeExp, by = country, color = continent)) +
geom_line() +
geom_point()
It’s important to note that each layer is drawn on top of the previous layer. In this example, the points have been drawn on top of the lines. Here’s another demonstration:
ggplot(data = gap, aes(x = year, y = lifeExp, by = country)) +
geom_line(aes(color = continent)) +
geom_point()
In this example, the aesthetic mapping of color has
been moved from the global plot options in ggplot to the
geom_line layer so it no longer applies to the points. Now
we can clearly see that the points are drawn on top of the lines.
\(~\)
\(~\)
\(~\)
\(~\)
Labels are considered to be their own layers in ggplot.
You can use labs(x = , y = , title = ) to set your
labels.
# add x and y axis labels
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color=continent)) +
geom_point(alpha = 0.5) +
labs(x = "GDP per capita (in US$)", y = "Life Expectancy (in years)",
title = "Relations of Life Expectancy and Ecomonic Development, by Continent")
You can also modify the theme of your plots. The themes in ggplot
include theme_bw(), theme_classic(),
theme_light(), theme_void(), etc.
# add x and y axis labels
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha = 0.5) +
labs(x = "GDP per capita (in US$)", y = "Life Expectancy (in years)",
title = "Relations of Life Expectancy and Ecomonic Development, by Continent") +
theme_bw()
\(~\)
\(~\)
\(~\)
\(~\)
In ggplot, we can change the scale of units on the
x-axis using the scale functions. These control the mapping between the
data values and visual values of an aesthetic.
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(alpha = 0.5) +
scale_x_log10() + # this sets the value in x asix in its log10
labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)",
title = "Relations of Life Expectancy and Ecomonic Development, by Continent")
We can also manually do that in the global aesthetic setting. For example,
# Here I take the natural log transformation on GDP per capita
ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp, color = continent)) +
geom_point(alpha = 0.5) +
labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)",
title = "Relations of Life Expectancy and Ecomonic Development, by Continent")
\(~\)
ggplot also provides us several useful statistical
tools. One of the most useful tools is the smooth line,
which draws regression lines for us. We can fit a simple relationship to
the data by adding another layer, geom_smooth:
ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp, color = continent)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)",
title = "Relations of Life Expectancy and Ecomonic Development, by Continent")
## `geom_smooth()` using formula = 'y ~ x'
Note that we have 5 lines, one for each continent, because of the color option is the global aes function. But if we move it, we get different results:
ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp)) +
geom_point(aes(color = continent), alpha = 0.5) +
geom_smooth(method = "lm") +
labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)",
title = "Relations of Life Expectancy and Ecomonic Development, by Continent")
So, there are two ways an aesthetic can be specified. Here, we set
the color aesthetic by passing it as an argument to
geom_point. Previously, we used the aes
function to define in a global setting.
We can make the line thicker by setting the size and color aesthetic in the geom_smooth layer:
ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp)) +
geom_point(aes(color = continent), alpha = 0.5) +
geom_smooth(method = "lm", size = 2, color = "red") +
labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)",
title = "Relations of Life Expectancy and Ecomonic Development, by Continent")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## Warning: Please use `linewidth` instead.
\(~\)
Lastly, You can use dplyr functions that we learned in
the last week to choose the data we want. For example, if we only take
care of the data on Asia and Americas before and after 1990s.
# before 1990s
gap %>%
filter(continent == "Americas" | continent == "Asia") %>%
filter(year <= 1990) %>%
ggplot(aes(x = log(gdpPercap), y = lifeExp, color = continent)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)",
title = "Relations of Life Expectancy and Ecomonic Development, Before 1990")
# after 1990s
gap %>%
filter(continent == "Americas" | continent == "Asia") %>%
filter(year > 1990) %>%
ggplot(aes(x = log(gdpPercap), y = lifeExp, color = continent)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)",
title = "Relations of Life Expectancy and Ecomonic Development, After 1990")
Pay attention here, when we use dpylr and pipes, we have
%>% to separate lines; however, in ggplot, we have
+ instead!
\(~\)
(Hint: replace color with shape, and shape label values are here.)
\(~\)
library(tidyverse)
library(gapminder)
gap <- gapminder
\(~\)
Previously, we visualized the change in life expectancy over time across all countries in one plot. Alternatively, we can split this out over multiple panels by adding a layer of facet panels.
\(~\)
facet_wrap() is a useful tool to display patterns for
different groups. For example:
ggplot(data = gap, aes(x = year, y = lifeExp)) +
geom_point() +
facet_wrap(~ continent)
\(~\)
If we would like to compare five continents in the same line, we can
use ncol = or nrow to set how many facets we’d
like to present in each column or row.
\(~\)
ggplot(data = gap, aes(x = year, y = lifeExp)) +
geom_point() +
facet_wrap(~ continent, ncol = 5)
\(~\)
\(~\)
Scales control the mapping from data to aesthetics. They take your data and turn it into something that you can see, like size, color, position or shape. Scales also provide the tools that let you read the plot: the axes and legends. You can generate many plots without knowing how scales work, but understanding scales and learning how to manipulate them will give you much more control.
\(~\)
Take the life expectancy over the years as an example:
ggplot(data = gap, aes(x = year, y = lifeExp)) +
geom_point() +
scale_y_continuous(limits = c(20, 100))
\(~\)
We can set the scale for y axis by adding a layer
scale_y_continuous(), since the lifeExp is a
continuous variable. We can modify its limits by limits =
and what values to show by breaks.
\(~\)
ggplot(data = gap, aes(x = year, y = lifeExp)) +
geom_point() +
scale_y_continuous(limits = c(20, 100), breaks = c(20, 30, 40, 50, 60, 70, 80, 90, 100))
\(~\)
We can also assign different labels to the values, by the
labels argument.
\(~\)
ggplot(data = gap, aes(x = year, y = lifeExp)) +
geom_point() +
scale_y_continuous(limits = c(20, 100), breaks = c(30, 60, 90),
labels = c("low (30)", "medium (60)", "high (90)"))
Legends are more complicated than axes. Because:
\(~\)
\(~\)
The following sections describe the options that control these interactions.
ggplot(data = gap, aes(x = year, y = lifeExp, color = continent)) +
geom_point() +
theme_bw()
\(~\)
By default, a layer will only appear if the corresponding aesthetic
is mapped to a variable with aes(). You can override
whether or not a layer appears in the legend with
show.legend = FALSE to prevent a layer from ever appearing
in the legend; TRUE forces it to appear when it otherwise
wouldn’t.
\(~\)
ggplot(data = gap, aes(x = year, y = lifeExp, color = continent)) +
geom_point(show.legend = FALSE) +
theme_bw()
\(~\)
You can also change the location of legend with theme()
function. The position and justification of legends are controlled by
the theme setting legend.position, which takes values
“right”, “left”, “top”, “bottom”, or “none” (no legend).
\(~\)
ggplot(data = gap, aes(x = year, y = lifeExp, color = continent)) +
geom_point() +
theme_bw() +
theme(legend.position = "bottom")
\(~\)
Alternatively, if there’s a lot of blank space in your plot you might
want to place the legend inside the plot. You can do this by setting
legend.position to a numeric vector of length two. The
numbers represent a relative location in the panel area:
c(0, 1) is the top-left corner and c(1, 0) is
the bottom-right corner. You control which corner of the legend the
legend.position refers to with legend.justification, which is specified
in a similar way. Unfortunately positioning the legend exactly where you
want it requires a lot of trial and error.
\(~\)
ggplot(data = gap, aes(x = year, y = lifeExp, color = continent)) +
geom_point() +
scale_y_continuous(limits = c(0, 100)) +
theme_bw() +
theme(legend.position = c(1, 0), legend.justification = c(1, 0))
\(~\)
Junliu, in this section, we need some challenges for participants to generate figures using the data they shared with us. Based on the content above, can you come up with some challenging questions? Also, can you give code for some examples in a R script?
\(~\)
\(~\)
\(~\)
This page is, in part, derived from the following sources:
R for Data Science licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0.
Rochelle Terman’s class notes for PLSC 31101: Computational Tools for Social Science.